[SPARK-22799][ML] Bucketizer should throw exception if single- and multi-column params are both set#19993
[SPARK-22799][ML] Bucketizer should throw exception if single- and multi-column params are both set#19993mgaido91 wants to merge 19 commits intoapache:masterfrom
Conversation
…lti-column params are both set
|
Test build #84970 has finished for PR 19993 at commit
|
|
Jenkins, retest this please |
|
Test build #84978 has finished for PR 19993 at commit
|
| if (isSet(inputCols) && isSet(inputCol) || isSet(inputCols) && isSet(outputCol) || | ||
| isSet(inputCol) && isSet(outputCols)) { | ||
| throw new IllegalArgumentException("Both `inputCol` and `inputCols` are set, `Bucketizer` " + | ||
| "only supports setting either `inputCol` or `inputCols`.") |
There was a problem hiding this comment.
Here it is better to add one more check:
if single column, only set splits param.
if multiple column, only set splitsArray param. and check:
inputCols.length == outputCols.length == splitsArray.length
There was a problem hiding this comment.
thanks. I will add these checks while doing the changes requested by @hhbyyh. Thanks.
hhbyyh
left a comment
There was a problem hiding this comment.
@mgaido91, Thanks for the PR.
Since we have several classes that need the improvement and more classes will support HasInputCols, I feel like we should develop some code that can be shared by other classes.
We also need common unit test methods for the case.
|
@hhbyyh thanks for the review. I see that for some classes there are ongoing PRs. Thus I cannot change them now in order to have a common place and a common test. Should I wait for those PRs to be merged then? Or were you suggesting something different? |
|
I would suggest to develop the common infrastructure and unit test first, then other PR can take it or we can send follow-up fix. cc @MLnick for advice. |
| @@ -140,10 +140,10 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String | |||
| * by `inputCol`. A warning will be printed if both are set. | |||
There was a problem hiding this comment.
We should modify the document above too.
|
@hhbyyh I created a common infrastructure. Please let me know if I have to modify it. Meanwhile, I'd like to discuss where to put the common UTs: do you have any specific idea about the right place? Thanks. |
|
Test build #85123 has finished for PR 19993 at commit
|
|
Test build #85124 has finished for PR 19993 at commit
|
|
Test build #85128 has finished for PR 19993 at commit
|
|
Test build #85130 has finished for PR 19993 at commit
|
| } else { | ||
| false | ||
| ParamValidators.assertColOrCols(this) | ||
| if (isSet(inputCol) && isSet(splitsArray)) { |
There was a problem hiding this comment.
I noticed isBucketizeMultipleColumns is invoked in many places and maybe we can put the checks in other places like transformSchema. It also makes the code consistent with function name.
| model match { | ||
| case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) => | ||
| raiseIncompatibleParamsException("inputCols", "inputCol") | ||
| case m: HasOutputCols with HasInputCol if m.isSet(m.outputCols) && m.isSet(m.inputCol) => |
There was a problem hiding this comment.
This may not necessarily be an error for some classes, but we can keep it for now.
| * this is not true, an `IllegalArgumentException` is raised. | ||
| * @param model | ||
| */ | ||
| def assertColOrCols(model: Params): Unit = { |
| } | ||
| } | ||
|
|
||
| def raiseIncompatibleParamsException(paramName1: String, paramName2: String): Unit = { |
| } | ||
|
|
||
| def raiseIncompatibleParamsException(paramName1: String, paramName2: String): Unit = { | ||
| throw new IllegalArgumentException(s"Both `$paramName1` and `$paramName2` are set.") |
There was a problem hiding this comment.
Error message can be more straight forward. e.g. $paramName1 and $paramName2 cannot be set simultaneously.
|
Test build #85133 has finished for PR 19993 at commit
|
|
@hhbyyh thanks for the comments. I already fixed all your comments, but I am waiting to push for the UT. Honestly I think that |
|
To make it available for other classes, we need to support checking for both |
|
thanks @hhbyyh, I updated the PR according to your suggestion and previous comments. |
|
Test build #85209 has finished for PR 19993 at commit
|
hhbyyh
left a comment
There was a problem hiding this comment.
Thanks. Add some suggestions for function names and test utility.
| * this is not true, an `IllegalArgumentException` is raised. | ||
| * @param model | ||
| */ | ||
| private[spark] def assertColOrCols(model: Params): Unit = { |
There was a problem hiding this comment.
suggestion for function name:
assertColOrCols --> checkMultiColumnParams
| * @param paramsClass The Class to be checked | ||
| * @param spark A `SparkSession` instance to use | ||
| */ | ||
| def checkMultiColumnParams(paramsClass: Class[_ <: Params], spark: SparkSession): Unit = { |
There was a problem hiding this comment.
suggestion for function name:
checkMultiColumnParams --> testMultiColumnParams
| // create fake input Dataset | ||
| val feature1 = Array(-1.0, 0.0, 1.0) | ||
| val feature2 = Array(1.0, 0.0, -1.0) | ||
| val df = feature1.zip(feature2).toSeq.toDF("feature1", "feature2") |
There was a problem hiding this comment.
I don't think the DataFrame here can be used for other transformers like StringIndexer.
How about add a Dataset as function parameter? And is it possible to use an instance obj: Params rather than paramsClass: Class[_ <: Params] as parameter, just to be more flexible.
There was a problem hiding this comment.
The reason why I created the dataframe inside the method was to control the names of the columns it has. Otherwise we can't ensure that those columns exist. I think that the type check is performed later, thus it is not a problem here. What do you think?
I preferred to use paramsClass: Class[_ <: Params] because I need a clean instance for each of the two checks: if an instance is passed I cannot enforce that it is clean, ie. some parameters weren't already set and I would need to copy it to create new instances as well, since otherwise the second check would be influenced by the first one. What do you think?
Thanks.
There was a problem hiding this comment.
We can send column names as parameter if necessary. We need to ensure the test utility can be used by most transformers with multiple column support.
I think that the type check is performed later, thus it is not a problem here.
I don't quite get it, in either transform or fit the data type will be checked and they will trigger exceptions.
I preferred to use paramsClass: Class[_ <: Params]
I'm only thinking about the case that default constructor is not sufficient to create a working Estimator/Transformer. If that's not a concern, then reflection is OK.
|
Test build #85253 has finished for PR 19993 at commit
|
|
Test build #86282 has finished for PR 19993 at commit
|
|
Since RC1 for 2.3 failed, it'd be great to get this into 2.3. @mgaido91 do you mind if I send my comments along with a PR to update this PR of yours? I'm rushing because of the time pressure to get this into 2.3 (to avoid a behavior change between 2.3 and 2.4). Thanks in advance! |
| @Since("1.4.0") | ||
| override def transformSchema(schema: StructType): StructType = { | ||
| if (isBucketizeMultipleColumns()) { | ||
| ParamValidators.checkExclusiveParams(this, "inputCol", "inputCols") |
There was a problem hiding this comment.
The problem with trying to use a general method like this is that it's hard to capture model-specific requirements. This currently misses checking to make sure that exactly one (not just <= 1) of each pair is available, plus that all of the single-column OR all of the multi-column Params are available. (The same issue occurs in #20146 ) It will also be hard to check these items and account for defaults.
I'd argue that it's not worth trying to use generic checking functions here.
There was a problem hiding this comment.
my initial implementation (with @hhbyyh's comments) was more generic and checked what you said. After, @MLnick and @viirya asked to switch to a more generic approach which is the current you see. I'm fine with either of those, but I think we need to choose one way and go in that direction, otherwise we just loose time.
There was a problem hiding this comment.
I see. I'll see if I can come up with something which is generic but handles these other checks.
| @DeveloperApi | ||
| object ParamValidators { | ||
|
|
||
| private val LOGGER = LoggerFactory.getLogger(ParamValidators.getClass) |
There was a problem hiding this comment.
Let's switch this to use the Logging trait, to match other MLlib patterns.
|
@jkbradley sure no problem, let me know how I can help. |
|
OK, sent: mgaido91#1 |
strengthened requirements about exclusive Params for single and multicolumn support
|
Test build #86416 has finished for PR 19993 at commit
|
|
Test build #86437 has finished for PR 19993 at commit
|
|
Test build #86439 has finished for PR 19993 at commit
|
| ("inputCols", Array("feature1", "feature2"))) | ||
| ParamsSuite.testExclusiveParams(new Bucketizer, df, ("outputCol", "result1"), | ||
| ("outputCols", Array("result1", "result2"))) | ||
| ParamsSuite.testExclusiveParams(new Bucketizer, df, ("splits", Array(-0.5, 0.0, 0.5)), |
There was a problem hiding this comment.
Only comment I have is that I believe this line is not testing what you may think.
As I read the checkSingleVsMultiColumnParams method, in this test case it will throw the error, not because both splits and splitsArray are set, but rather because both inputCol & inputCols are unset.
Actually it applies to the line above too.
There was a problem hiding this comment.
@MLnick actually it will fail for both reasons. We can add more test cases to check each of these two cases if you think it is needed.
|
Overall looks good with @jkbradley's changes. I just left a comment on the param test cases as I think they're not quite complete |
|
Well yes it would - but the method checks inputCols/inputCol first so will
always fail for that reason here, ie we aren’t actually testing the full
code path
…On Mon, 22 Jan 2018 at 16:43, Marco Gaido ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala
<#19993 (comment)>:
> - test("Both inputCol and inputCols are set") {
- val bucket = new Bucketizer()
- .setInputCol("feature1")
- .setOutputCol("result")
- .setSplits(Array(-0.5, 0.0, 0.5))
- .setInputCols(Array("feature1", "feature2"))
-
- // When both are set, we ignore `inputCols` and just map the column specified by `inputCol`.
- assert(bucket.isBucketizeMultipleColumns() == false)
+ test("assert exception is thrown if both multi-column and single-column params are set") {
+ val df = Seq((0.5, 0.3), (0.5, -0.4)).toDF("feature1", "feature2")
+ ParamsSuite.testExclusiveParams(new Bucketizer, df, ("inputCol", "feature1"),
+ ("inputCols", Array("feature1", "feature2")))
+ ParamsSuite.testExclusiveParams(new Bucketizer, df, ("outputCol", "result1"),
+ ("outputCols", Array("result1", "result2")))
+ ParamsSuite.testExclusiveParams(new Bucketizer, df, ("splits", Array(-0.5, 0.0, 0.5)),
@MLnick <https://github.com/mlnick> actually it will fail for both
reasons. We can add more test cases to check each of these two cases if you
think it is needed.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19993 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA_SB1zkYZ5V4SOlLliOtxQ_6CCvoBm4ks5tNJ6egaJpZM4RD1b4>
.
|
|
Test build #86525 has finished for PR 19993 at commit
|
MLnick
left a comment
There was a problem hiding this comment.
A couple minor comments - but that can be left for a clean up in a follow up PR if need be.
I'd prefer to merge this to branch-2.3.
| import org.apache.spark.SparkFunSuite | ||
| import org.apache.spark.ml.{Estimator, Transformer} | ||
| import org.apache.spark.ml.linalg.{Vector, Vectors} | ||
| import org.apache.spark.ml.param.shared.{HasInputCol, HasInputCols, HasOutputCol, HasOutputCols} |
There was a problem hiding this comment.
I don't think these are used any longer?
| ParamsSuite.testExclusiveParams(new Bucketizer, df, ("outputCol", "feature1"), | ||
| ("splits", Array(-0.5, 0.0, 0.5))) | ||
|
|
||
| // the following should fail because not all the params are set |
There was a problem hiding this comment.
Technically here we should probably also test the inputCols + outputCols case (i.e. that not setting splitsArray also throws an exception).
|
Test build #86589 has finished for PR 19993 at commit
|
|
+1 merge this to 2.3 |
…lti-column params are both set ## What changes were proposed in this pull request? Currently there is a mixed situation when both single- and multi-column are supported. In some cases exceptions are thrown, in others only a warning log is emitted. In this discussion https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049, the decision was to throw an exception. The PR throws an exception in `Bucketizer`, instead of logging a warning. ## How was this patch tested? modified UT Author: Marco Gaido <marcogaido91@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #19993 from mgaido91/SPARK-22799. (cherry picked from commit cd3956d) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
|
Merged to master / branch-2.3 |
|
Thanks @mgaido91 and @jkbradley for working on this and others for review |
What changes were proposed in this pull request?
Currently there is a mixed situation when both single- and multi-column are supported. In some cases exceptions are thrown, in others only a warning log is emitted. In this discussion https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049, the decision was to throw an exception.
The PR throws an exception in
Bucketizer, instead of logging a warning.How was this patch tested?
modified UT